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Abstract 

We  present  and  analyze  a  novel  regularization  technique 
based  on  enhancing  our  dataset  with  corrupted  copies  of 
our  original  data.  The  motivation  is  that  since  the  learning 
algorithm  lacks  information  about  which  parts  of  the  data 
are  reliable,  it  has  to  make  more  robust  classification  func¬ 
tions.  Using  this  framework,  we  propose  a  simple  addition 
to  the  gentle  boosting  algorithm  which  enables  it  to  work 
with  only  a  few  examples.  We  test  this  new  algorithm  on  a 
variety  of  datasets  and  show  convincing  results. 

1.  Introduction 

Boosting  -  the  iterative  combination  of  classifiers  to  build  a 
strong  classifier  -  is  a  popular  learning  technique.  The  al¬ 
gorithms  based  on  it,  such  as  AdaBoost  and  gentleBoost, 
are  easy  to  implement,  work  reasonably  fast,  and  in  general 
produce  classifiers  with  good  generalization  properties  for 
large  enough  datasets.  If  the  dataset  is  not  large  enough  and 
there  are  many  features,  these  algorithms  tend  to  overfit  and 
perform  much  worse  than  the  popular  Support  Vector  Ma¬ 
chine  (SVM)  algorithm.  SVM  can  be  successfully  applied 
to  all  datasets,  from  small  to  very  large. 

The  major  drawback  of  SVM  is  that  it  uses,  at  run  time, 
when  classifying  a  new  example  x,  all  the  measurements 
(features)  of  x.  This  poses  a  problem  because,  while  we 
would  like  to  cover  many  promising  features  during  train¬ 
ing,  computing  all  the  features  at  run-time  might  be  too 
costly.  This  is  especially  true  for  object  detection  problems 
in  vision,  where  we  often  need  to  search  the  whole  image 
in  several  scales  over  thousands  of  possible  locations,  each 
location  producing  one  such  vector  x.  While  several  ap¬ 
proaches  for  combining  feature  selection  with  SVM  have 
been  suggested  in  the  past  (e.g.,  [14]),  they  are  rarely  used. 

The  complexity  of  the  feature  vector  can  be  controlled 
more  easily  by  using  boosting  techniques  over  weak  classi¬ 
fiers  based  on  single  features  (e.g.,  regression  stumps),  such 
as  in  the  highly  successful  system  of  [13].  In  this  case,  the 
number  of  features  used  is  bounded  by  the  number  of  itera¬ 
tions  in  the  boosting  process.  However,  since  boosting  tends 
to  overfit  on  small  datasets,  there  is  a  bit  of  a  dilemma  here. 


An  ideal  algorithm  would  enable  good  control  over  the  to¬ 
tal  number  of  features,  while  being  able  to  learn  from  only 
a  few  examples.  Such  an  algorithm  is  presented  in  Sec.  4. 

This  new  algorithm  is  based  on  gentleBoost.  Within  it 
we  implemented  a  regularization  technique  based  on  a  sim¬ 
ple  idea:  add  corrupted  copies  of  your  training  dataset  to  the 
original  one,  and  the  algorithm  will  not  be  able  to  overfit.  A 
general  background  on  fitting  and  regularization  is  given  in 
the  next  Section. 

2.  Background 

We  are  given  a  set  of  n  examples  z,  =  {(^,  t/j)}”=1,  x  £ 
X,  y  £  y  drawn  from  a  joint  distribution  V  on  X  x  y. 
The  ultimate  goal  of  the  learning  algorithm  is  to  produce  a 
function  /  :  X  — >  y  such  that  the  expected  error  of  /  given 
by  the  expression  E(x  y^-p(f(x)  ^  y)  is  minimized.  The 
boolean  expression  inside  the  parentheses  evaluates  to  one 
if  it  holds,  zero  otherwise. 

Since  we  do  not  know  the  distribution  V  we  are  tempted 
to  minimize  the  empirical  error  given  by  X^=i(/ (xi)  ^ 
iji).  The  problem  is  that  if  the  space  of  functions  from  which 
the  learning  algorithm  selects  /  is  too  large,  we  are  at  risk 
of  overfitting  (learning  to  deal  only  with  the  training  error). 
Therefore,  while  the  empirical  error  is  small,  the  expected 
error  is  large.  In  other  words,  the  generalization  error  (the 
difference  of  empirical  error  from  expected  error)  is  large. 
Overfitting  can  be  avoided  by  using  any  one  of  several  reg¬ 
ularization  techniques. 

Overfitting  is  usually  the  result  of  allowing  too  much 
freedom  in  the  selection  of  the  function  /.  Thus,  the  most 
basic  regularization  technique  is  to  limit  the  number  of  free 
parameters  we  use  while  fitting  the  function  /.  For  exam¬ 
ple,  in  binary  classification  we  may  limit  ourselves  to  learn¬ 
ing  functions  of  the  form  /( x)  =  (hT x  >  0)  (we  assume 
X  =  3?".  h  is  a  vector  of  free  parameters).  Using  such 
functions,  we  reduce  the  risk  of  overfitting,  but  may  never 
optimally  learn  the  target  function  (i.e.,  the  “true  function” 
f{x)  =  y  that  is  behind  the  distribution  V)  of  other  forms, 
e.g.,  we  will  not  be  able  to  learn  f(x)  =  (a:(l)2 — x(2)  >  0). 

Another  regularization  technique  is  to  minimize  the  em¬ 
pirical  error  subject  to  constraints  on  the  learned  functions. 
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For  example,  we  can  require  that  the  norm  of  the  vector  of 
free  parameters  h  be  less  than  one.  A  related  but  different 
regularization  technique  is  to  minimize  the  empirical  error 
together  with  a  penalty  term  on  the  complexity  of  the  func¬ 
tion  we  fit.  The  most  popular  penalty  term  -  Tikhonov  reg¬ 
ularization  -  has  a  quadratic  form.  Using  the  linear  model 
above,  an  appropriate  penalty  function  would  be  \\h\\2,  and 
we  would  minimize  X]"=1(((iT xi  >  0)  7^  Vi)  +  I  Nil- 

Sometimes,  adding  a  regularization  term  to  the  optimiza¬ 
tion  problem  solved  by  the  algorithm  is  not  trivial.  In  the 
most  extreme  case,  the  algorithm  is  a  black  box  we  cannot 
alter  at  all.  Still,  a  simple  form  of  regularization  called  noise 
injection  can  be  employed.  In  the  noise  injection  technique, 
the  training  dataset  is  enriched  by  multiple  copies  of  each 
training  data  point  A  zero-mean,  low- variance  Gaussian 
noise  (independent  for  each  coordinate)  is  added  to  each 
copy,  and  the  original  label  y.t  is  preserved.  The  motivation 
is  that  if  two  data  points  x,x'  are  close  (i.e.,  ||*  —  x'\\  is 
small),  we  would  like  f{x)  and  f(x')  to  have  similar  val¬ 
ues.  By  introducing  many  examples  with  similar  x  values, 
and  identical  y  values  we  teach  the  classifier  to  have  this  sta¬ 
bility  property.  Hence,  the  learned  function  is  encouraged 
to  be  smooth  (at  least  around  the  training  points). 

The  study  of  the  noise  injection  technique,  which  blos¬ 
somed  in  the  mid  90’s,  established  the  following  results  on 
noise  injection:  (1)  It  is  an  effective  way  to  reduce  gener¬ 
alization  error.  (2)  It  has  a  similar  effect  on  shrinkage  (the 
statistical  term  for  regularization)  of  the  parameters  in  some 
simple  models  (e.g.,  [2]).  (3)  It  is  equivalent  to  Tikhonov 
regularization  [1],  Note  that  this  does  not  mean  that  we  can 
always  use  Tikhonov  regularization  instead  of  noise  injec¬ 
tion,  as  for  some  learning  algorithms  it  is  not  possible  to 
create  a  regularized  version. 

The  technique  we  introduce  next  is  similar  in  spirit  to 
noise  injection.  However,  it  is  different  enough  that  the  re¬ 
sults  obtained  for  noise  injection  will  not  hold  for  it.  For  ex¬ 
ample,  the  results  of  [1]  use  a  Taylor  expansion  around  the 
original  data  points.  Such  an  approximation  will  not  hold 
for  our  new  technique,  since  the  “noise”  is  too  large  (i.e., 
the  new  datapoint  is  too  different).  Other  important  proper¬ 
ties  that  might  not  hold  are  the  independence  of  noise  across 
coordinates,  and  the  zero  mean  of  the  noise. 

Our  regularization  technique  is  based  on  creating  cor¬ 
rupted  copies  of  the  dataset.  Each  new  data  point  is  a 
copy  of  one  original  training  point,  picked  at  random,  where 
one  random  coordinate  (feature)  is  replaced  with  a  different 
value-usually  the  value  of  the  same  coordinate  in  another 
random  training  example.  The  basic  procedure  used  to  gen¬ 
erate  the  new  example  is  illustrated  in  Fig.  1.  We  call  it  the 
feature  knock  out  (KO)  procedure,  since  one  feature  value 
is  being  altered  dramatically.  It  is  repeated  many  times  to 
create  new  examples.  It  can  be  used  with  any  learning  al¬ 
gorithm,  and  we  use  it  in  the  analysis  presented  in  Sec.  3. 


Input:  (*1,2/1), ...,  ( Xm,ym )  where  xz  £  SR™,  y%  £  Y. 
Output:  one  synthesized  pair  (*,  y). 

1.  Select  two  examples  xa,Xb  at  random. 

2.  Select  a  random  feature  k  £  [l..n]. 

3.  Set  *  <—  xa  and  y  <—  ya. 

4.  Replace  feature  k  of  x:  x(k)  <—  Xb(k). 


Figure  1 :  The  Feature  Knockout  Procedure 

However,  as  we  focus  our  application  emphasis  on  boost¬ 
ing,  we  use  the  specialized  version  in  Fig.  2. 

The  KO  regularization  technique  is  especially  suited  for 
use  when  learning  from  only  a  few  examples.  The  robust¬ 
ness  we  demand  from  the  selected  classification  function 
is  much  more  than  local  smoothness  around  the  classifi¬ 
cation  points  (c.f.  noise  injection).  This  kind  of  smooth¬ 
ness  is  easy  to  achieve  when  example  points  are  far  from 
one  another.  Our  regularization,  however,  is  less  restrictive 
than  demanding  uniform  smoothness  (Tikhonov)  or  requir¬ 
ing  the  reduction  of  as  many  parameters  as  possible.  Both 
of  these  approaches  might  not  be  ideal  when  only  a  few  ex¬ 
amples  are  available  because  there  is  nothing  to  balance  a 
large  amount  of  uniform  smoothness,  and  it  is  easy  to  fit  a 
model  that  uses  very  few  parameters.  Instead,  we  encourage 
redundancy  in  the  classifier  since,  in  contrast  to  the  shortage 
of  training  examples,  there  is  an  abundance  of  features. 

3.  Analysis 

The  effect  of  adding  noise  to  the  training  data  depends  on 
the  learning  algorithm  used,  and  is  highly  complex.  Even 
for  the  case  of  adding  a  zero-mean,  low-variance  Gaussian 
noise  (noise  injection)  this  effect  was  studied  only  for  sim¬ 
ple  algorithms  (e.g.  [2])  or  the  square  loss  function  [1], 

In  Sec.  3.1  we  study  the  effect  of  feature  knock-out  on 
the  well  known  linear  least  square  regression  problem.  We 
show  that  it  leads  to  a  scaled  version  of  Tikhonov  regular¬ 
ization.  Compare  this  to  Bishop’s  result  (using  a  Taylor  ex¬ 
pansion)  that  noise  injection  is  equivalent  to  Tikhonov  regu¬ 
larization.  Following  in  Sec.  3.2,  we  will  try  to  analyze  how 
feature  KO  affects  the  variance  of  the  learned  classifier. 

3.1.  Effect  of  feature  KO  on  linear  regression 

One  of  the  most  basic  models  we  can  apply  to  the  data  is 
the  linear  model.  In  this  model,  the  input  examples  xt  £ 
iRn,i  =  1  ..to  are  organized  as  the  columns  of  the  matrix 
A  £  K')Xm;  the  corresponding  y,  values  are  stacked  in  one 
vector  y  £  $tm.  The  prediction  made  by  the  model  is  given 
by  XT  h,  where  h  is  the  vector  of  free  parameters  we  have 
to  fit  to  the  data.  In  the  common  least  squares  case,  ||  y  — 
XT h\\2  is  minimized. 
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In  the  case  that  the  matrix  A  is  full  rank  and  overdeter¬ 
mined,  it  is  well  known  that  the  optimal  solution  is  h  = 
A+y,  where  A+  =  {AAT)~l A  is  known  as  the  pseudo  in¬ 
verse  of  the  transpose  of  A  (our  definition  of  A  is  the  trans¬ 
pose  of  the  common  text  book  definition).  If  A  is  not  full 
rank,  the  matrix  inverse  ( AA r)_1  is  not  well  defined.  How¬ 
ever,  as  an  operator  in  the  range  of  A  it  is  well  defined,  and 
the  above  expression  still  holds,  i.e.,  even  if  there  is  an  am¬ 
biguity  in  selecting  the  inverse  matrix,  there  is  no  ambiguity 
in  the  operation  of  all  possible  matrices  on  the  range  of  the 
columns  of  A,  which  is  what  we  care  about. 

Even  so,  if  the  covariance  matrix  (AAff)  has  a  large  con¬ 
dition  number  (i.e.,  it  is  close  to  being  singular),  small  per¬ 
turbations  of  the  data  result  in  large  changes  to  h,  and  the 
system  is  unstable.  The  solution  fits  the  data  A  well,  but 
does  not  fit  data  which  is  very  close  to  A,  hence  there  is 
overfitting.  To  stabilize  the  system,  we  apply  regularization. 

Tikhonov  regularization  is  based  on  minimizing  \\y  — 
XT h\  | 2  +  A |  \h\  |2.  This  is  equivalent  to  using  a  regularized 
pseudo  inverse:  A J  =  (AAT  +  A I)~lA,  where  I  is  the 
identity  nxn  matrix,  and  A  is  the  regularization  parameter. 

In  many  applications,  the  linear  system  we  need  to  solve 
is  badly  scaled,  e.g.,  one  variable  is  much  larger  in  magni¬ 
tude  than  the  other  variables.  In  order  to  rectify  this,  we  may 
apply  a  transformation  to  the  data  that  weights  each  variable 
differently,  or  equivalently  weight  the  vector  h  by  applying 
a  diagonal  matrix  D,  such  that  h  becomes  h  =  Dh. 

Instead  of  solving  the  original  system  Ah  =  y,  we  now 
solve  the  system  Ah  =  y,  where  A  =  D~1A.  Solving 
this  system  using  Tikhonov  regularization  is  termed  “scaled 
Tikhonov  regularization.”  If  D  is  unknown,  a  natural  choice 
is  the  diagonal  matrix  with  the  entries  Dkk  =  sj ( AAT)kk 
[9].  We  will  now  show  that  using  the  knock-out  procedure 
to  add  many  new  examples  is  equivalent  to  scaled  Tikhonov 
regularization,  using  the  weight  matrix  above. 

Lemma  1  When  using  the  linear  model  with  a  least 
squares  fit,  applying  the  knock  out  procedure  in  Fig.  1  to 
generate  many  examples  is  equivalent  to  applying  scaled 
Tikhonov  regularization  where  Dkk  =  \J (AAT)kk- 

Proof  see  [15] 

To  get  a  better  understanding  of  the  way  feature  knock 
out  works,  we  study  the  behavior  of  scaled  Tikhonov  reg¬ 
ularization.  In  the  boosting  case,  the  knock  out  procedure 
is  expected  to  produce  solutions  which  make  use  of  more 
features.  Are  these  models  more  complex?  This  is  hard  to 
define  in  the  general  case,  but  easy  to  answer  in  the  linear 
least  square  case  study. 

In  linear  models,  the  predictions  y  on  the  training  data 
take  the  form:  y  =  Py.  For  example,  in  the  unregularized 
pseudo  inverse  case  we  have  y  =  ATh  =  Aff  (AA1)-1  Ay, 
and  therefore  P  =  AT  (AAT)~1A.  There  is  a  simple  mea¬ 
sure  of  complexity  called  the  effective  degrees  of  freedom 


[7],  which  is  just  XV(P)  for  linear  models.  A  model  with 
P  =  I  (the  identity  matrix)  has  zero  training  error,  but  may 
overfit.  In  the  full  rank  case,  it  has  as  many  effective  degrees 
of  freedom  as  the  number  of  features  (TV(P)  =  n). 

Lemma  2  The  linear  model  obtained  using  scaled 
Tikhonov  regularization  has  a  lower  effective  degree  of 
freedom  than  the  linear  model  obtained  using  unregularized 
least  squares. 

Proof  see  [15], 

Similar  to  the  work  done  on  noise  injection,  we  exam¬ 
ined  the  effect  of  our  procedure  on  a  simple  regression  tech¬ 
nique.  We  saw  that  feature  knock  out  resembles  the  effect  of 
scaled  Tikhonov  regularization,  i.e.,  high  norm  features  are 
penalized  by  the  knock  out  procedure.  However,  boosting 
over  regressions  stumps  seems  to  be  scale  invariant.  Mul¬ 
tiplying  all  the  values  of  a  feature  by  some  constant  does 
not  change  the  resulting  classifier,  since  the  process  that  fits 
the  regression  stumps  (see  Sec.  4)  uses  the  values  of  each 
feature  to  determine  the  thresholds  that  it  uses.  However,  a 
closer  look  reveals  the  connection  between  scaling  and  the 
effect  of  the  knock  out  procedure  on  boosting.  Boosting 
over  stumps  (e.g.,  [13])  chooses  at  each  round  one  out  of  n 
features,  and  one  threshold  for  this  feature.  The  thresholds 
are  picked  from  the  m  possible  values  that  exist  in  between 
every  two  sorted  feature  values.  The  feature  and  the  thresh¬ 
old  define  a  “weak  classifier”  (the  basic  building  blocks 
of  the  ensemble  classifier  built  by  the  boosting  procedure 
[10]),  which  predicts  -1  or  +1  according  to  the  threshold. 
Equivalently,  we  can  say  that  boosting  over  stumps  chooses 
from  a  set  of  nm  binary  features  -  these  features  are  exactly 
the  values  returned  by  the  weak  classifiers.  These  nm  fea¬ 
tures  have  different  norms,  and  are  not  scale  invariant.  Let 
us  call  each  such  feature  an  nm-feature. 

Using  the  intuitions  of  the  linear  least  squares  case,  we 
would  like  to  inhibit  features  of  high  magnitude.  All  nm- 
features  have  the  same  norm  (\/\m)),  but  different  en¬ 
tropies  (a  measure  which  is  highly  related  to  norm).  These 
entropies  depend  only  on  the  ratio  of  positive  values  in  each 
n?n-feature  -  call  this  ratio  p. 

Creating  new  examples  using  the  feature  knock-out  pro¬ 
cedure  does  not  change  the  number  of  possible  thresholds, 
and  therefore  the  number  of  features  remains  the  same.  The 
values  of  the  new  example  in  the  nm  feature  space  will  be 
the  same  for  all  features  originating  from  the  n  —  1  fea¬ 
tures  that  were  not  changed  in  the  knockout  procedure.  The 
value  for  a  knocked-out  feature  (feature  k  in  Fig.  1),  will 
change  if  the  new  value  is  on  the  other  side  of  the  thresh¬ 
old  as  compared  to  the  old  value.  This  will  happen  with 
probability  2p(l  —  p).  If  this  sign  flip  happens  then  the  fea¬ 
ture  is  inhibited  because  it  gives  two  different  classifications 
to  two  examples  with  the  same  label  (KO  leaves  labels  un¬ 
changed).  Note  that  the  entropy  of  a  feature  with  a  positive 
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ratio  of  p  and  the  probability  2p(l  —  p)  behave  similarly: 
both  rise  monotonically  for  0  <  p  <  1/2  and  then  drop 
symmetrically.  Hence,  We  obtain  the  following  result: 

Lemma  3  Let  t  be  a  single  nm-feature  created  by  com¬ 
bining  a  single  input  feature  with  a  threshold.  The  amount 
of  inhibition  t  undergoes,  as  the  result  of  applying  feature 
knockout,  grows  monotonically  with  the  entropy  ofp. 

Hence,  similarly  to  the  scaling  in  the  linear  case,  the 
knock  out  procedure  inhibits  high  magnitude  features  (here 
the  magnitude  is  measured  by  the  entropy).  Note  that  in  the 
algorithm  presented  in  Sec.  4,  a  feature  is  used  for  knock¬ 
out  only  after  it  was  selected  to  be  a  part  of  the  output  classi¬ 
fier.  Still,  KO  inhibits  more  weak  classifiers  based  on  these 
features  with  higher  entropies,  making  them  less  likely  to 
get  picked  again.  It  is  possible  to  perform  this  higher- 
entropy  preferential  inhibition  directly  on  all  features,  there¬ 
fore  simulating  the  full  knock-out  procedure.  The  imple¬ 
mentation  of  this  is  left  for  future  experiments. 

3.2.  Bias/variance  decompositions 

Many  training  algorithms  can  be  interpreted  as  trying  to 
minimize  a  cost  function  of  the  form  L(f(xf),  yf), 

where  L  is  a  loss  function.  For  example,  in  the  0/1  loss 
function  L(f(x)1y)  =  ( f(x )  y ),  we  pay  1  if  the  la¬ 

bels  are  different,  0  otherwise.  By  applying  the  knock¬ 
out  procedure  to  generate  more  training  data,  an  algorithm 
that  minimizes  such  a  cost  function  will  actually  minimize: 
E"=  i  Ex-Cx(xj)-£'(/(*)>2/*)>  where  cx(x)  represents  the 
distribution  of  all  knocked-out  examples  created  from  x. 

Consider  a  bias-variance  decomposition  based  on  the 
0/1  loss  function,  as  analyzed  in  [3].  We  follow  the  ter¬ 
minology  of  [3]  with  a  somewhat  different  derivation,  and 
for  the  presentation  below  we  include  a  simplified  version. 
Assume  for  simplicity  that  each  training  example  occurs 
in  our  dataset  with  only  one  label,  i.e.,  if  x,  =  x:)  then 
yi  =  yj.  Define  the  optimal  prediction  /*  to  be  the  “true” 
label  f*(xi)  =  yi.  Define  the  main  prediction  of  a  func¬ 
tion  /  to  be  just  the  prediction  f(x).  The  bias  is  defined 
to  be  the  loss  between  the  optimal  and  main  predictions: 
B(x)  =  ( f(x )  f  f*(x)).  The  variance  V(x)  is  defined 
to  be  the  expected  loss  of  the  prediction  with  regard  to  the 
main  prediction:  V(x)  =  E x~cx(x  )(/(a 0  t A  These 

definitions  allow  us  to  present  the  following  observation: 

Observation  1  Let  BO  be  the  set  of  all  training-  example- 
indices  for  which  the  bias  B( xf)  is  zero  (the  unbiased  set). 
Let  B1  be  the  set  for  which  B(x i)  =  1  (the  biased  set). 
Then,  EIU  Ex~Cx(xi)(f(x)  ^  Vi)  =  E ?=iB(xi)  + 

EiS-BO  V(xi)  —  EieBl  V(xi) 

In  the  unbiased  case  ( B(x )  =  0),  the  variance  (V (a;))  in¬ 
creases  the  training  error.  In  the  biased  case  (B(x)  =  1),  the 


variance  at  point  x  decreases  the  error.  A  function  /,  which 
minimizes  the  training  cost  function  that  was  obtained  us¬ 
ing  feature  knock-out,  has  to  deal  with  these  two  types  of 
variance  directly  while  training.  Define  the  net  variance  to 
be  the  difference  of  the  biased  variance  from  the  unbiased; 
a  function  trained  using  the  feature  knock-out  procedure  is 
then  expected  to  have  a  higher  net  variance  than  a  function 
trained  without  this  procedure.  If  we  assume  our  corrup¬ 
tion  process  Cx  is  a  reasonable  model  of  the  robustness 
expected  from  our  classifier,  a  good  classifier  would  have 
a  high  net  variance  on  the  testing  data.  The  net  variance 
measured  in  our  experiments  [15]  shows  the  effect  of  the 
feature  knockout  approach. 

4.  The  gentleBoostKO  algorithm 

While  our  regularization  procedure  can  be  applied,  in  prin¬ 
ciple,  to  any  learning  algorithm,  using  it  directly  when  the 
number  of  features  n  is  high  might  be  computationally  de¬ 
manding.  This  is  because  for  each  one  of  the  m  training  ex¬ 
amples,  as  many  as  n(m  —  1)  new  examples  can  be  created. 
Covering  even  a  small  portion  of  this  space  might  require 
the  creation  of  many  synthesized  examples. 

However,  for  some  algorithms  our  regularization  tech¬ 
nique  can  be  applied  with  very  little  overhead.  For  boost¬ 
ing  over  regression  stumps,  it  is  sufficient  to  modify  those 
features  that  participate  in  the  trained  ensemble  (i.e.,  those 
features  that  actually  participate  in  the  classification). 

The  basic  algorithm  used  in  our  experiments  is  specified 
in  Fig.  2.  It  is  a  modified  version  of  the  gentleBoost  algo¬ 
rithm  [6].  gentleBoost  seems  to  converge  faster  than  Ad- 
aBoost,  and  performs  better  for  object  detection  problems 
[12],  At  each  boosting  round,  a  regression  function  is  fit¬ 
ted  (by  weighted  least-squared  error)  to  each  feature  in  the 
training  set.  We  used  linear  regression  for  our  experiments, 
fitting  parameters  a,  b  and  th  so  that  our  regression  func¬ 
tions  are  of  the  form  f(x)  =  a(x  >  th)  +  b.  The  regression 
function  with  the  least  weighted  squared  error  is  added  to 
the  total  classifier  H (x)  and  its  associated  feature  (/cmjn)  is 
used  for  Feature  Knockout  (step  d). 

In  the  Feature  Knockout  step,  a  new  example  is  created 
using  the  class  of  a  randomly  selected  example  xa  and  all 
of  its  feature  values  except  for  the  value  at  fcmin.  The  value 
for  this  feature  is  taken  from  a  second  randomly-selected 
example  Xb-  The  new  example  x,,l+t  is  then  appended  to  the 
training  set.  In  order  to  quantify  the  importance  of  the  new 
example  in  the  boosting  process,  a  weight  has  to  be  assigned 
to  it.  The  weight  wm+t  of  the  new  example  is  estimated  by 
copying  the  weight  of  the  example  from  which  most  of  the 
features  are  taken  (xa).  Alternatively,  a  more  precise  weight 
can  be  determined  by  applying  the  total  classifier  H(x)  to 
the  new  example. 

As  with  any  boosting  procedure,  each  iteration  ends  with 
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Input:  (xi,yi), ( Xm,ym )  where  Xi  G  3£n,  yi  eY  =  ±1. 

Output:  Composite  classifier  H(x). 

1.  Initialize  weights  Wi  «—  1/m. 

2.  for  t  =  1,2,3,  ...T. 

(a)  For  each  feature  k,  fit  a  regression  function  fj:k\x)  by 
weighted  least  squares  on  yi  to  Xi  with  weights  Wi,  i  = 
l..m  +  t  —  1. 

(b)  Let  kmin  be  the  index  of  the  feature  with  the  minimal  asso¬ 
ciated  weighted  least  square  error. 

(c)  Update  the  classifier  H(x)  H(x)  +  f(krniri^ 

(d)  Use  Feature  KO  to  create  a  new  example  x-m+t- 

Select  two  random  indices  1  <  a,b  <  m 

Xm-\-t  *■  &a 

Vm+t  ya 

(e)  Set  new  example  weight  to  that  of  its  source: 

(f)  Update  the  weights  and  normalize: 

_  .  fumin')  (  \ 

Wi  <—  Wie  ViJt  \xi)^  i  = 

Wi  <-  Wf/2^.=1  Wi 

3.  Output  the  final  classifier  H(x) 


Figure  2:  The  GentleBoostKO  Algorithm 

the  update  of  the  weights  of  all  examples  (including  the  new 
one),  and  a  new  round  of  boosting  begins.  This  iterative  pro¬ 
cess  finishes  when  the  weights  of  the  examples  converge, 
or  after  a  fixed  number  of  iterations.  In  our  experiments, 
we  stopped  the  boosting  after  100  rounds-enough  to  ensure 
convergence  in  all  cases. 

5.  Experiments 

Visual  recognition  using  the  Caltech  datasets.  We 

tested  our  gentleBoostKO  algorithm  on  several  Cal¬ 
tech  object  recognition  datasets  that  were  presented  in 
[5],  In  each  experiment  we  had  to  distinguish  be¬ 
tween  images  containing  an  object  and  background  im¬ 
ages  that  do  not  contain  the  object.  The  datasets: 
Airplanes,  Cars,  Faces,  Leafs  and  Motorbikes,  as 
well  as  the  background  images  were  downloaded  from 
http://www.vision.caltech.edU/.For  the  ex¬ 
periments  we  used  the  predefined  splits  (available  to  all  the 
datasets  but  the  Leafs  dataset).  For  leafs,  we  used  a  random 
split  of  50%  training  and  50%  testing.  Note  that  since  our 
methods  are  discriminative,  we  needed  a  negative  training 
set.  For  this  end,  we  removed  30  random  examples  from  the 
negative  testing  set,  and  used  them  for  training. 

To  turn  each  image  into  feature-vectors  we  used  500  C2 
features  [11].  These  extremely  successful  features  allow  us 
to  learn  to  recognize  objects  using  few  training  images,  and 


the  results  seem  to  be  comparable  or  better  than  the  results 
reported  in  [4],  The  results  are  shown  in  Fig.  3.  To  compare 
with  previous  work,  we  used  the  error  at  the  equilibrium- 
point  between  false  and  true  positives  as  our  error-measure. 
It  is  clear  that  for  a  few  dozen  examples,  SVM,  gentleBoost 
and  gentleBoostKO  have  the  same  performance  level.  How¬ 
ever,  for  only  a  few  training  examples,  gentleBoost  does  not 
perform  as  well  as  SVM,  while  gentleBoostKO  achieves  the 
same  level  of  performance. 

We  also  tried  to  apply  Lowe’s  SIFT  features  [8]  to  the 
same  datasets,  although  these  features  were  designed  for 
a  different  task.  For  each  image,  we  used  Lowe’s  bina¬ 
ries  to  comute  the  SIFT  description  of  each  key  point.  We 
then  sampled  from  the  training  set  1000  random  keypoints 
k\, ...,  fciooo-  Let  {fcf}  be  the  set  of  all  keypoints  associ¬ 
ated  with  image  I.  We  represented  each  training  and  testing 
image  I  by  a  vector  of  1000  elements:  [r;/(l)...i>7(1000)], 
such  that  vr (j)  =  mirii\\kj  —  k{\\.  Note  that  in  [8]  the  use 
of  the  ratio  of  distances  between  the  closest  and  the  next 
closest  points  were  encouraged  (and  not  just  the  minimum 
distance).  For  our  application,  which  disregards  all  geo¬ 
metric  information,  we  found  that  using  the  minimum  gives 
much  better  results.  For  the  testing  and  training  splits  re¬ 
ported  in  [5]  we  got  the  following  results  (ME=mean  error, 
EqE=error  at  equilibrium): 


Algorithm 

Planes 

Cars 

Faces 

Leaves 

Motor. 

Lin.  SVM  ME 

0.104 

0.019 

0.107 

0.118 

0.033 

gentleB  ME 

0.118 

0.036 

0.168 

0.137 

0.026 

gentleBKO  ME 

0.100 

0.033 

0.119 

0.114 

0.023 

Lin.  SVM  EqE 

0.108 

0.018 

0.111 

0.126 

0.007 

gentleB  EqE 

0.120 

0.037 

0.166 

0.132 

0.003 

gentleBKO  EqE 

0.111 

0.030 

0.136 

0.120 

0.008 

Car  type  identification.  This  dataset  consists  of  480  im¬ 
ages  of  private  cars,  and  248  images  of  mid  sized  vehicles 
(such  as  SUV’s).  All  images  are  20  x  20  pixels,  and  were 
collected  using  Mobileye’s  car  detector,  on  a  video  stream 
taken  from  the  front  window  of  a  moving  car.  The  task  is  to 
learn  to  identify  private  cars  from  mid  sized  vehicles,  which 
has  some  safety  applications.  Taking  into  account  the  low 
resolution  and  the  variability  in  the  two  classes,  this  is  a 
difficult  task.  The  results  are  shown  on  the  bottom  right 
corner  of  Fig.  3.  Each  point  of  the  graph  shows  the  mean 
error  when  applying  the  algorithms  to  training  sets  of  dif¬ 
ferent  size  (between  5  and  40  percent  of  the  data).  The  rest 
of  the  examples  were  used  for  testing.  It  is  evident  that 
for  this  dataset  gentleBoost  outperforms  SVM.  Still,  gentle¬ 
BoostKO  does  even  better. 
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